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Abstract 

Background: CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis 
predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of 
similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of -40-60%. 
CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein 
sequence space. 

Results: The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The 
structure revealed two extra domains: an a-F|3 domain inserted in the metallopeptidase-like domain and a 
C-terminal circularly permuted winged-helix-turn-helix domain. 

Conclusions: Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a 
view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we 
propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a 
modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide 
the groundwork for the experimental verification of the function. 
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Background 

CA_C2195 from Clostridium acetobutylicum [UniProtKB: 
Q97H19_CLOAB] is a novel 434-residue protein of un- 
known function. Initial sequence analysis suggested that 
this protein could be a metallopeptidase. A PSI-BLAST 
[1] search against UniProt revealed that there are over 200 
other similar proteins of unknown function. Pairwise se- 
quence identities of these proteins to CA_C2195 vary be- 
tween 40-60%. We present here the crystal structure of 
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CA_C2195, determined as part of the Protein Structure 
Initiative program to extend structural coverage of novel 
protein sequence space to provide structure-based func- 
tion assignment [2,3]- CA_C2195 was specifically targeted 
by the Joint Center for Structural Genomics (JCSG) in an 
effort to increase the structural coverage of proteins in 
Pfam [4] clan CL0035 of metallopeptidases (Peptidase 
MH/MC/MF), which has -64000 protein sequences (in- 
cluding CA_C2195) in 12 families (Pfam v27.0, March 
2013) but with only limited (-0.2%), biased structural 
coverage. The families that form this clan contain many 
sequences, are functionally diverse, and are important in 
numerous biological processes. For example, recombinant 
bacterial carboxypeptidase G2 is used in cancer therapy to 
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hydrolyze methotrexate [5] and is being tested in prodrug 
therapy; and human aspartoacylase is implicated in Cana- 
vans disease in the brain [6]. There are also non-peptidase 
homologs of these proteins: some of these have active 
catalytic domains, but perform distinct albeit related en- 
zymatic functions, such as the glutaminyl-peptide cyclo- 
transferase. In other cases the homologous domains are 
not catalytically active and they perform protein-protein 
interaction based functions, such as the transferrin recep- 
tor proteins 1 and 2. JCSG has determined -20 structures 
to date from clan CL0035 (see http://www.topsan.org/ 
Groups/Zinc_Peptidase). Proteins in these families [7,8] 
have a broad phylogenetic spread across all kingdoms of 
life and show substantial sequence divergence. 

The structure of CA_C2195 revealed that it is com- 
posed of three domains. Our sequence and structure 
analysis led to the assignment of these three domains of 
CA_C2195 and its homologs to new Pfam families 
(using standard Pfam protocols) [4], to be released in 
the next Pfam update, version 28.0: the N-terminal 
metallopeptidase-like domain to DUF4910 (Domain of 
Unknown Function, [Pfam:PF16254]), which is distantly 
related by sequence to the Peptidase_M28 family [Pfam: 
PF04389] in clan CL0035 (MEROPS [9] M28 family in 
the peptidase MH clan); the insert domain to DUF2172 
[Pfam:PF09940] (a reassignment of the existing entry); 
and the C-terminal wHTH to HTH_47 [Pfam:PF16221]. 
We believe that our results may aid in the design of 
structure-based biochemical experiments to further ex- 
plore the biology of these proteins similar to other re- 
cent efforts on proteins of unknown function [10-15]. 
Based on a recent study, many DUF proteins are likely 
essential proteins [16]. 

Results and discussion 

Overall structure 

The protein production and crystallization of CA_C2195 
was performed by standard protocols in the JCSG High- 
Throughput Structural Biology pipeline (www.jcsg.org) as 
briefly described in Methods. The crystal structure was 
determined to 2.37 A by Multi-wavelength Anomalous 
Diffraction (MAD) phasing and atomic coordinates and 
experimental structure factors have been deposited in the 
Protein Data Bank (www.wwpdb.org) with PDB accession 
code 3k9t. Data collection, model and refinement statistics 
are summarized in Table 1 [17-20]. There is one molecule 
of CA_C2195 in the crystallographic asymmetric unit 
(Figure 1), which contains 422 of the 434 residues in the 
entire protein as well as GlyO that remains after cleavage 
of the protein expression and purification tag. Residues 
374-386 were disordered in the structure and were ex- 
cluded from the protein model. A zinc ion (Zn) was mod- 
eled at the putative peptidase active site based on 
presence in the crystallization condition as well as an 



anomalous difference Fourier map. An imidazole mol- 
ecule (Imd) from the crystallization condition was also 
modeled based on electron density to coordinate with 
the Zn. Other solvent molecules include two chloride 
ions and four (4R)-2-methylpentane-2,4-diol (MRD) 
molecules from the crystallization condition as well as 
water molecules. Sequencing of the cloned construct 
indicated that residue Pro309 was substituted with a 
serine residue, which was supported by electron dens- 
ity. Based on crystal packing analysis, using the 'Protein 
interfaces, surfaces and assemblies' service PISA (www. 
ebi.ac.uk/pdbe/prot_int/pistart.html) [21] at the European 
Bioinformatics Institute (FBI), the predicted biological 
assembly of CA_C2195 is a trimer. Size-exclusion chro- 
matography coupled with static light scattering, per- 
formed during protein production and crystallization 
screening, also supports a protein trimer in solution. A 
search for other proteins that may share overall struc- 
tural similarity to CA_C2195, using the Protein struc- 
ture comparison service Fold at FBI (www.ebi.ac.uk/ 
msd-srv/ssm) [22] produced no significant hits. Exam- 
ination of the structure revealed three distinct domains: 
a Peptidase_M28-like metaUopeptidase domain with a 
small a + |3 domain inserted into it and a C-terminal 
wHTH domain [23,24]. 

N-termlnal metallopeptidase-like domain (DUF4910) 

Out of the 434 residues in CA_C2195, approximately resi- 
dues 1-55 and 165-355 form the metallopeptidase-like 
domain, forming the portion that is related to the Peptida- 
se_M28 family [Pfam:PF04389]. A search for other struc- 
turally related proteins using Fold produces significant 
hits to several aminopeptidases (SSM Q-score -0.4, root- 
mean-square deviation (r.m.s.d.) -2.3 A between Ca atoms 
over the entire domain) with PDB codes [PDB:2dea] 
(Figure 2), [PDB:lrtq], [PDB:2iq6] and [PDB:3b3t], all 
structures from the Peptidase_M28 family. However, des- 
pite the degree of structural conservation, the level of se- 
quence identity is very low (-17%). The putative active 
site includes a Zn coordinated with residues Asp 195, 
Hisl89, His324 and the N3 atom from the Imd. It is pos- 
sible that Imd mimics a portion of the physiological lig- 
and. To identify conserved residues and any potential 
clustering of such residues, we aligned 82 homologs (ran- 
ging from 35-60% sequence identity) and used the conser- 
vation profile to mark-up the structure corresponding to 
DUF4910 (Figure 3). This sequence conservation analysis 
identified a cluster of conserved residues located within a 
cleft of the structure, which include Aspl95, Hisl89 and 
His324 that coordinate to the Zn, and together form a pu- 
tative active site. 

All known Peptidase_M28 members bind two Zn ions, 
which are described as "co-catalytic" as both Zn ions par- 
ticipate in the catalytic activity. In contrast, CA_C2195 
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Table 1 Summary of crystal parameters, data collection and refinement statistics for PDB 3k9t 





Ai MAD-Se 


A2 MAD-Se 


A3 MAD-Se 


Data collection 








Space group 


H32 






Unit cell parameters (A) 


a = 153.78, b= 153.78, c=l 


68.38 




Wavelength (A) 


0.91837 


0.97925 


0.9791 1 


Resolution range (A) 


29.1-2.37 


29.1-2.44 


29.1-2.25 




(2.43-2.37) 


(2.50-2.44) 


(2.31-2.25) 


No. of observations 


1 72,585 


157,212 


403,378 


No. of unique reflections 


31,178 


28,543 


36,347 


Completeness (%) 


99.9 (100.0) 


99.9 (100.0) 


100.0 (100.0) 


Mean l/o (1) 


9.0 (1.5) 


9.2 (1.6) 


12.7 (1.9) 


Rmerge On (%) 


18.9 (101.7) 


18.3 (93.1) 


20.9 (132.1) 


Rmeas On /* (%) 


20.9 (112.3) 


20.3 (102.9) 


21.9 (138.4) 


Rp,i,r,, on (%) 


8.8 (47.4) 


8.6 (43.4) 


6.5 (41.2) 


Model and refinement statistics 








Resolution range (A) 


29.1-2.37 






No. of reflections (total) 


31,177^ 






No. of reflections (test) 


1576 






Completeness (%) 


100.0 






Data set used in refinement 


Xi 






Cutoff criteria 


|F|>0 






R ^ 

i^cryst 


0.171 






f^free 


0.212 






Stereochemical parameters 








Restraints (RMSD observed) 








Bond angles (°) 


1.61 






Bond lengths (A) 


0.015 






Average isotropic B value^^ (A^) 


29.5 






ESU*** based on Rfree (A) 


0.18 






Protein residues/ atoms 


422 / 3386 






Waters / Zn/ CI/ Imd/ MRD 


221/1/2/1/4 







Values in parentheses are for the highest resolution shell. 

' Rmerge-^hkmhkl) - (l(hkl))\/lh„ li(hkl). 

* Rn.eas-lhklmN-^)]'%\l|(hkl) - (l(hkl))\/lhk^il i(hkl)m. 

Rpim (precision-indicating R^r^erge) = I/,/c/[(1/(A/-1)] {hkl) - < Khkl) > \ I T^ki^i mi) [19,20]. 
^ Typically, the number of unique reflections used in refinement is slightly less than the total number that were integrated and scaled. Reflections are excluded 
owing to systematic absences, negative intensities and rounding errors in the resolution limits and unit-cell parameters. 

^ Rcryst- ^hki\\F obs\ " |/^caic||/^/iw|/^obs|/ where Fcaic and Fobs are the calculated and observed structure-factor amplitudes, respectively. Rfree is the same as Rcryst but for 
5.1% of the total reflections chosen at random and omitted from refinement. 

This value represents the total B that includes TLS and residual B components. 
*** Estimated overall coordinate error [18]. 



has one bound Zn ion. In an earlier study, it was found 
that HmrA [PDB:3ram] [25], a Peptidase_M20 [Pfam: 
PF01546] protein (M20 and M28 peptidases are both in 
the MH clan and closely related to each other), also con- 
tained only one Zn ion and that this might have been 
enough to change its specificity from that of an exopeptid- 
ase (aminopeptidase or carboxypeptidase, which are the 
predominant specificities in both M20 and M28) to that 



of an endopeptidase. Despite only one Zn ion in HmrA (it 
is not fully clear whether the HmrA physiologically con- 
tains only one Zn ion or whether this was an artifact of 
the crystallization and that two Zn should be present), all 
five Zn-coordinating residues expected in Peptidase_M20 
are conserved, which is not the case with CA_C2195. In 
CA_C2195 only the residues that bind the single Zn ion 
have been retained. 
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DUF4910 



Figure 1 Crystal structure and domain architecture. The crystal structure of CA_C2195 from Clostridium acetobutylltlcum, with the N- and C- 
termini labeled as 'N' and 'C, reveals 3 domains: residues 1-55 (blue) and 165-355 (yellow) form the N-terminal metallopeptidase domain, 
DUF4910; residues 56-164 (grey) form the DUF2172 domain; and residues 356-434 (red) form a C-terminal wHTH domain, HTH_47. Residues in 
the putative active site are Aspl95 (red stick); and Hisl89 and His324 (cyan sticks), and they are bound to a Zn ion from the crystallization 
condition. Imidazole from the crystallization condition is also bound to the active site Zn. The lower panel is a linear representation of the domain 
architecture of CA_C2195. 



CA_C2195 does not possess conventional Peptidase_M28 
active site residues, as both of the essential, invariant, active 
site residues have been replaced: Serl91 replaces the con- 
served Asp and Pro225 replaces the conserved Glu. Serl91 
is conserved as Ser in 73 of the 82 homologs that were 




Figure 2 Metallopeptidase domain structure. The 

metallopeptidase domain of CA_C2195 (blue) is similar in structure 
to several other metallopeptidases, as for example, the 
Peptidase_M28 family aminopeptidase [PDB:2dea] (orange) with r.m. 
s.d. -2.3 A between Cq atoms over the entire domain despite a very 
low sequence identity of -17%. 



aligned and present as either Ala or Gly in the remaining 
9 homologs. Pro225 is conserved as Pro in 81 of the ho- 
mologs and present as Val in 1 homolog. All enzymes in 
Peptidase_M28, the closest known peptidase family by 
structure and sequence, have these residues conserved. 
There are over 550 non-pep tidase M28 homologs in 
MEROPS, but only a few have been characterized. Those 
that have been characterized have evolved different 
functions, for example, the transferrin receptor proteins 
1 and 2, and glutaminyl-peptide cyclotransferase. The 
glutaminyl-peptide cyclotransferase also has all five 
Zn-binding and both active site Asp and Glu residues 
conserved [26], therefore, CA_C2195 is unlikely to have 
comparable catalytic activity. Transferrin in blood serum 
binds iron, which is internalized once transferrin docks to 
its receptor [27] . 

Insert domain (DUF2172) 

Residues 56-164 (approximately) in CA_C2195 form a 
separate globular domain inserted into the DUF4910 do- 
main. This insert domain adopts an a+|3 fold that does 
not closely match any other known structures. However, 
careful visual inspection shows (Figure 4) that the insert 
domain bears a resemblance to the "Protease-associated" 
domain (PA domain, [Pfam:PF02225]) in terms of gross 
structure and orientation of insertion. A comparison of 
the CA_C2195 structure with the structure of an amino- 
peptidase from Aneurinibacillus sp, strain AM-1 [PDB: 
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Figure 3 Residue conservation analysis in the metallopeptidase domain. The residues likely involved in activity are Aspl95, Hisl89 and 
His324 and have the highest conservation (dark pink, scale 9 in a range of 1 to 9 in CONSURF) across CA_C2195 homologs. The presence of 
other highly conserved residues around the putative active site suggests that they will also be involved in function. The least conserved residues 
(cyan, scale 1) in CA_C2195 are also visible. 



2ek8], suggests that its DUF2172 domain is very likely 
derived from the PA protein domain family (Figure 4). 
The PA domain is similarly found inserted within several 
other peptidase domains, which are catalytically unre- 
lated to each other. Interestingly, the PA domain is 
found inserted in some Peptidase_M28 domains at a 
structurally equivalent site to that of DUF2172 in 
DUF4910. It has been suggested that the PA domain 
may act as a lid, which covers the active site and may be 
involved in protein recognition in vacuolar sorting re- 
ceptors [28]. The PA domain of aminopeptidase has a 
characteristic "swivelling" |3/p/a domain fold [24]. In the 
DUF2172 domain in CA_C2195, there is a turn of an a- 
helix instead of a large |3-a-|3-a-|3 substructure on one 
side of the PA domain fold, whereas the remaining 
structures of the two domains retain overall similarity 
and differ only by a few minor insertion or deletions 
(Figure 4). Given their equivalent location relative to the 
peptidase domain, we propose that the DUF2172 do- 
main has probably evolved from the PA domain in a 
pre-existing multi-domain context, that is, after its mer- 
ger with the catalytic domain. 

To study sequence conservation in DUF2172 homo- 
logs, thereby allowing the identification of residues that 
may be functionally important, 80 sequences ranging in 
identity from 47-66% were aligned and the conservation 



profile used to mark-up the structure corresponding to 
DUF2172 (Figure 5). Numerous aromatic amino acid 
residues appear to be the most conserved in this do- 
main: Trp70, Tyr98, Tyrl27, TyrlSl and Tyrl32. Specu- 
latively, these residues might be important in binding to 
target proteins if, like the PA domain, this domain is in- 
volved in protein recognition. 

C-terminal wHTH domain (HTH_47) 

One of the most interesting aspects of CA_C2195 and its 
homologs is the presence of a unique C-terminal circularly 
permuted wHTH domain in conjunction with the metallo- 
peptidase domain. A search for other proteins using Fold 
that are similar to this domain (residues 356-434) results 
in very significant hits (SSM Q-score -0.4, r.m.s.d -2.0 A 
between Ca atoms over the entire domain) with other 
wHTH domains, although the sequence identities of 
these hits are in the 15-19% range (the PDB codes of 
the top 4 hits are: [PDB:2xvc], [PDB:2yu3], [PDB:lcf7], 
[PDB:3o6b]). A Jackhmmer [29] search using default 
search parameters identifies matches on the third iter- 
ation to sequences corresponding to the position of 
MarR_2 [Pfam:PF12802] transcription factors. Struc- 
tures of sequences belonging to MarR_2 also adopt a 
wHTH topology, supporting the structure-based search 
at the sequence level, but clearly show that this wHTH 
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Figure 4 Comparison of tiie DUF2172 and PA domains. (A) The DUF2172 domain in CA_C2195 (grey, left panel) bears some fold 
resemblance to the PA (Protease-associated) domain (grey, right panel), which has been observed in a Peptidase_M28 family member [PDB:2ek8, 
right panel) even though there is no discernible sequence identity. Analogous to the proposed role of the PA domain, the DUF2172 domain may 
be forming a lid modulating access to the peptidase active site and may also be involved in substrate recognition and specificity. Molecules in 
the panels are oriented such that the peptidase domains in both superimpose. The active sites in both molecules are shown in cyan sticks and 
black spheres. (B) A large substructure of the PA domain fold (yellow, left panel) is replaced with a turn of a-helix in DUF2172 (orange, 
right panel). 



has diverged in terms of sequence from other l<nown 
wHTH domains. To identify residues that may be func- 
tionally important based on sequence conservation, 43 
homologs ranging in sequence identity from 36%-79% 
were used, out of which only one sequence had higher 
than 53% sequence identity (Figure 6). This revealed 
that residues with the highest conservation are surface 
exposed in this domain, suggesting that their role may 
be in surface-mediated contacts. 

The juxtaposition of a metallopeptidase with a wHTH 
domain is not common, although a similar domain 



architecture has been observed previously in methionine 
aminopeptidase-2 (Met-AP2). The wHTH domain in 
Met-AP2 is inserted within a distinct peptidase domain 
belonging to the Peptidase_M24 family [Pfam:PF00557], 
which includes the creatinases and prolidases. In Met- 
AP2, the inserted wHTH domain has been shown to be 
important for the recognition and specificity of the sub- 
strate, namely, the amino-termini of proteins processed 
by the enzyme [30] [PDB:lboa]. Interestingly, compari- 
son of the CA_C2195 and Met-AP2 wHTH domains in- 
dicates that they have a similar permutation of the 
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Figure 5 Residue conservation analysis in tlie DUF2172 domain. The presence of highly conserved aromatic residues (darl< pinl<) including 
Trp70, Tyr98, Tyrl27, TyrlSl and Tyrl32, indicates residues that may be involved in substrate recognition if this domain has a functionality 
associated with substrate interactions. 



wHTH domain (Figure 7). Furthermore, as in the case in 
the Met-AP2, the CA_C2195 wHTH domain is spatially 
located as a distinct module, which points away from the 
core catalytic domain. Thus, by analogy to the Met-AP2, 
we propose that the permuted wHTH might serve in a 
similar capacity in substrate recognition and specificity in 
CA_C2195 and its homologs. In a more general sense, the 
recognition of circularly permuted domains independently 
fused to two distinct classes of peptidases raises the possi- 
bility that these domains may have been more generally 
recruited as potential peptide-recognition modules early 
in the history of proteins. 

Oligomeric assembly 

As mentioned above, crystal packing analysis predicts a 
trimer as the oligomeric form in solution, which is sup- 
ported by size-exclusion chromatography coupled with 
static light scattering. The trimeric assembly is formed 
by the interaction of residues in the wHTH domain 
(loop residues 362-368 and helix residues 389-393) with 
loop residues 62-64 in the insert domain and loop resi- 
dues 302-305 and 293-294 in the metallopeptidase-like 
domain. Some of these residues forming the assembly in 
all 3 domains show high conservation, indicating that 
these are likely to be the key binding residues in the pro- 
tein interaction interface. In particular, a substantial por- 
tion of the surface on one side of the wHTH appears to 
be responsible for mediating the monomer protein 



interactions in the oligomeric state, covering the major- 
ity of the highly conserved residues. These observations 
strongly suggest that the wHTH functions in mediating 
protein interactions in the oligomeric state. 

Conserved gene neighborhoods point to a potential role 
in modified carbohydrate biosynthesis 

As described above, the sequence and structural analysis in- 
dicates that the conserved residue pattern does not con- 
form to any known peptidase active site. Therefore, to 
better understand the possible biochemical function of 
CA_C2195, we used contextual information gleaned from 
conserved gene neighborhoods. Several studies have 
shown that genome context or conserved gene- 
neighborhoods provide information in terms of func- 
tionally interacting partners or complexes to which 
particular proteins belong [31-33]. Interestingly, we 
found a strong gene-neighborhood association (and in 
some cases gene fusions) between CA_C2195 and its 
homologs with several genes involved in biosynthesis of 
a modified carbohydrate across several phylogenetically 
distinct bacterial taxa, namely actinobacteria, firmi- 
cutes, cyanobacteria, bacteroidetes, planctomycetes 
(Table 2, Additional file 1, Additional file 2). This wide 
phyletic spread of the association suggests that the co- 
occurrence is likely to be of functional importance for 
these enzymes. Among the strongly linked genes we 
found those coding for a sugar epimerase/dehydratase. 
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Q368 



R372 



Figure 6 Residue conservation analysis in C-terminal wHTH domain. Residues in tine C-terminal circularly permuted wHTH domain that 
might be involved in substrate recognition and specificity based on their high conservation across CA_C2195 homologs (residues with highest 
conservation are in dark pink) are visualized. 



a sugar phosphate nucleotidyltransferase, a glycosyl trans- 
ferase, an aminosugar A/^-acetyltransferase and a SAM- 
dependent sugar methylase. These enzymes are all 
associated with carbohydrate metabolism, and are in- 
dicative that a modified sugar is being synthesized by 
the action of multiple enzymes and converted to a 
nucleotide diphosphate linked sugar by the action of 
the nucleotidyltransferase. This NDP-sugar then prob- 
ably serves as the substrate for the glycosyltransferase 
that transfers it to a target moiety. However, examin- 
ation of the predicted operons also reveals variability 
especially in terms of the numbers of genes encoding 
for glycosyltransferases, sugar methylases and other 
auxiliary modifying enzymes such as those that act on 
sugars to add acyl groups (Table 2, Additional file 1, 
Additional file 2). 

This linkage between a gene coding for a peptidase-like 
protein with a carbohydrate biosynthetic system could be 
explained in at least three alternative ways: 1) CA_C2195 
protein and its homologs are post-translationally glycosyl- 
ated; 2) The DUF4910 domain cleaves target proteins 
alongside their modification by glycosylation; 3) The 
DUF4910 domain actually participates in the biosynthesis 
of a sugar-derived metabolite by catalyzing a reaction bio- 
chemically distinct from the classical peptidase reaction. 



Circumstantial evidence supports the third alternative. 
First, as discussed above, the CA_C2195-like genes do not 
seem to preserve the conventional metallopeptidase active 
site. Moreover, these genes are usually embedded in the 
middle of an operon with genes for carbohydrate- 
modif)dng enzymes on either side. Second, these operons 
do not show any linked genes coding for other potential 
target proteins. Third, in several cases these operons con- 
tain genes for a transmembrane carbohydrate export pro- 
tein (related to the O-antigen and teichoic acid export 
proteins) and transmembrane sugar pyruvyltransferase 
(Table 2, Additional file 1, Additional file 2). These pro- 
teins suggest that the modified carbohydrate is unlikely to 
be used to modify intracellular proteins; rather it is likely 
to be translocated to the cell-surface and used as part of a 
surface polysaccharide/lipopolysaccharide. In light of these 
observations it is possible that DUF4910 is involved in 
modification of the sugar-derived metabolites, perhaps via 
transacylation of a peptide/glutamine to an amino sugar. 
In principle, they could also be used in an amidase reac- 
tion for deacylation of a sugar amide, but this would imply 
that they utilize distinctive active site residues (see above). 
TMPRED (www.ch.embnet.org/software/TMPRED_form. 
html) predicts one significant transmembrane helix in 
CA_C2195 (residues 192-213, inside to outside, score 
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Figure 7 Comparison of wHTH domains. (A) The circularly permuted wHTH domain observed in CA_C2195 (red, left panel) resembles another 
circularly permuted wHTH domain present in the structure of a Peptidase_M24 family aminopeptidase [PDB:lboa] (red, right panel), and may be 
involved in substrate recognition and specificity. (B) The wHTH domain in CA_C2195 (left) is compared to the wHTH domain from 
Peptidase_M24 [PDB:lboa] (center) and a wHTH domain from a transcription factor [PDB:lcf7] (right), which was one of the proteins most similar 
in structure to the CA_C2195 wHTH domain. Each domain is colored from the N-terminus (blue) to the C-terminus (red). All domains are in a 
similar orientation. (C) Topology diagrams for the three domains in (B) in the same order depicting the arrangement of secondary structure 
elements and circular permutation in the CA_C2195 wHTH compared to the transcription factor wHTH. Cylinders represent a-helices, arrows 
represent (3-strands and the N- and C-termini are labeled. 



557), which is buried in the metallopeptidase-like domain 
(and therefore incorrectly predicted to be transmem- 
brane), and Phobius [34] predicts most of the protein to 
be extracellular, with a dip where the possible transmem- 
brane helix might be. Signal? [35] fails to predict a signal 
peptide and so it is unknown how this protein gets into 
the periplasm or if it is extracellular. 



Conclusions 

The crystal structure of CA_C2195 and subsequent 
sequence-structure-function analysis shows that CA_C2195 
(and -200 homologs, ranging in sequence identity from 40- 
60%) is a three-domain protein, which includes a C- 
terminal wHTH domain and a DUF2172 domain inserted 
in the DUF4910 metallopeptidase-lil<e domain. The 
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Table 2 Gene neighborhood analysis 



Gl number 


Gene 


Locus 


Protein 


Product 


1 c;qqc;/1 






_L.Z 1 OD 


MD ^/IQQn/l 1 

l\lr_o4ooU4. 1 


Glycosyltransferase 


1 JOyj4jD 


spsE 




1 o/ 


l\lr_o4ooUj. 1 


N-acetylneuraminic acid synthase + SAP sugar-binding 
(condenses of phosphoenolpyruvate and A/-acetylmannosamine) 


15895457 




CA 


C21 88 


NP_348806.1 


Glycosyltransferase 


1 joyj4jo 




CA 


C21 89 


l\lr_j400U/ . 1 


ATP-grasp amino acyl ligase 


1 joyj4jy 


spsF 


CA 


C21 90 


MD ^ARRflR 1 
l\lr_j4oOUo. 1 


Sugar phosphate nucleotidyltransferase 


1 JOyj4DU 




CA 


C2192 


MD ^zlRRflQ 1 
l\lr_j4ooUy. 1 


Glyoxylase 


1 SRQS4f^l 

1 JO^JHU 1 




CA_ 


_C2193 


MP ^4RRini 
iNr jT-oo 1 u. 1 


ni IF^RRD -1- (^l\/rn<;\/ltr?in<;fpr?i<;p 


15895462 




CA_ 


_C2194 


NP_348811.1 


nucleoside-diphosphate sugar epimerase 


/ 5895463 




C/\_ 


C2195 


NP_348812.1 


Peptidose-like (peptidose_MH superfomily) 


15895464 




CA_ 


_C2196 


NP_348813.1 


Methyltransferase + Glycosyltransferase 
(currently annotated as: MAFJIaglO, DUFl 15) 


15895465 




CA_ 


_C2197 


NP_348814.1 


aminosugar A/-acetyltransferase 


15895466 


acpA 


CA_ 


_C2198 


NP_348815.1 


acyl carrier protein 


15895467 




CA_ 


_C2199 


NP_348816.1 


aminosugsar A/-acetyltransferase + HAD Phosphatase 



presence of the PA domain-like DUF2172 domain shows 
similarity in domain architecture to some members of the 
Peptidase_M28 family [PDB: 2ek8]. However, the presences 
of a C- terminal wHTH domain in CA_C2195, shows simi- 
larity to domain architectures found in Peptidase_M24 
[PDB:lboa]. Analysis of sequence conservation reveals a 
cluster of non-sequential, highly conserved residues on the 
surface of the structure of CA_C2195, which are likely to 
be functionally important, some of which in the wHTH are 
involved in forming the protein interaction interface in the 
oligomeric form. It is possible that these proteins do not 
have any metallopeptidase activity because of the absence 
of all the catalytic residues that are expected from other 
characterized members of this peptidase clan. Based on 
gene neighborhood analysis, we propose that CA_C2195 
and its homologs could be involved in the biosynthesis of 
modified carbohydrates. Given the importance of cell sur- 
face polysaccharides in inter-organismal interactions, fur- 
ther characterization of the biochemical activity of this 
protein is likely to be of interest in the case of pathogens 
that encode a CA_C2195 like gene, such as Brucella and 
Campylobacter, 

Methods 

Protein production and crystallization of CA_C2195 was 
carried out by standard JCSG protocols [36-38]. Data col- 
lection was performed at SSRL beamline 9-2. The crystal 
structure was determined by MAD phasing using a seleno- 
methionine-derivatized protein. X-ray data collection, pro- 
cessing, structure solution, tracing, crystallographic refine- 
ment and model building were performed using BLU-ICE 
[39], MOSFLM [40]/SCALA [41], SHELXD [42] /AUTO- 
SHARP [43], ARP/wARP [44], REFMAC [45] and COOT 
[46]. To find homologs for sequence conservation analysis. 



PSI-BLAST was used to search the Uniref90 database in 3 
iterations with e-value cutoff of 0.0001, searching for a 
maximum of 150 homologs between 35-95%, using 
MAFFT as the alignment method MAFFT, Bayesian calcu- 
lation method, and JTT evolutionary substitution method, 
as implemented in CONSURF [47]. Figure 2 was prepared 
using Chimera (http://www.cgl.ucsf.edu/chimera) and all 
others were prepared using PyMOL [48]. The topology dia- 
grams in Figure 7C are from PDBsum [49]. Gene neighbor- 
hood was comprehensively analyzed using a custom Perl 
script using the CA_C219S gene or its homolog as anchors. 
This script uses either the PTT file (downloadable from the 
NCBI ftp site) or the Genbank file in the case of whole gen- 
ome shot gun sequences to extract 20 gene neighbors on 
the 3' and 5' sides of a given query gene. The protein se- 
quences of all neighbors were clustered using the BLAS- 
TCLUST program (ftp://ftp.ncbi.nih.gov/blast/documents/ 
blastclust.html) to identify related sequences in gene neigh- 
borhoods. Each cluster of homologous proteins were then 
assigned an annotation based on the domain architecture 
or conserved shared domain which were detected using 
Pfam models and in-house profiles run using RPS -BLAST 
[50]. This allowed an initial annotation of gene neighbor- 
hoods and their grouping based on conservation of neigh- 
borhood associations. In further analysis, care was taken to 
ensure that genes are unidirectional on the same strand of 
DNA and shared a putative common promoter to be 
counted as a single operon. If they were head to head on 
opposite strands they were examined for potential bidirec- 
tion promoter sharing patterns. A total of 4789 representa- 
tive bacterial and archaeal genomes were analyzed for the 
detection of CA_C2195 orthologs. These genomes spanned 
representatives of all currently known major lineages of 
bacteria and archaea. From these 229 genomes were 
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identified as having CA_C2195 orthologs with gene neigh- 
borhoods and further analysis was performed on this subset 
of genomes. Within this subset conserved gene neighbor- 
hood associations were detected in 10 major bacterial 
clades namely actinobacteria, firmicutes, cyanobacteria, 
planctomycetes, bacteroidetes, nitrospirae, alphaproteobac- 
teria, betaproteobacteria, epsilonproteobacteria and spiro- 
chaetes. Using a simulation with sampling with no 
replacement and the average genome size of 4000 genes we 
found that such genes as described above coming together 
by chance alone in such neighborhoods was p < 10"^. For 
all bioinformatics analyses that were performed using ho- 
mologs within a family for comparison, the chosen se- 
quences were well over the inclusion threshold for the 
family as built. 

Availability of supporting data 

Atomic coordinates and experimental structure factors 
for CA_C2195 have been deposited in the Protein Data 
Bank (www.wwpdb.org) with PDB accession code 3k9t 
(DOI:10.2210/pdb3k9t/pdb). 
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